
"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.
This lesson is the third in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.
Last week we did a deep dive on some of the more popular and broadly applicable visualizations for conveying basic ideas about your data. This week will focus on tidying up your visualizations and adding those extra finishing touches that will help polish them off. Adding, removing, altering graphs. Getting these little details correct help you to avoid alterations with additional software outside of R.
At the end of this lecture you will have covered the following topics
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Today's datasets will focus on a number of datasets we've used in our previous lectures.
This data file contains 4 objects:
covid_phu_long.df: COVID-19 daily cases values across Ontario public health units seen in lecture 01. covid_phu_window.df: sliding window data generated from covid_phu_long.df based on a 14-day rolling mean.phu_by_total_cases_desc: a list of Ontario PHUs in descending order by caseloadcovid_demographics_total.df: age group demographics in a long-format that we generated in lecture 02.repr- a package useful for altering some of the attributes of objects related to the R kernel.
tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2
viridis helps to create color-blind palettes for our data visualizations
lubridate and zoo are helper packages used for working with date formats in R
ggthemes, directlabels, ggforce, ggbeeswarm, gghighlight, and ggExtra will provide us new geoms and methods for plotting or altering how our plots look.
ggpubr for arranging our plots.
# None of these packages are already available on JupyterHub
# install.packages("directlabels")
# install.packages("ggbeeswarm", dependencies = TRUE)
# install.packages("gghighlight")
# install.packages("ggExtra")
# install.packages("ggpubr")
# install.packages("ggtext")
# Packages to help tidy our data
library(tidyverse)
# Packages for the graphical analysis section
library(repr)
library(viridis)
# New visualisation packages
library(ggthemes)
library(directlabels)
library(ggforce)
library(ggbeeswarm)
library(gghighlight)
library(ggExtra)
library(ggpubr)
library(ggtext)
# packages used for working with/formating dates in R
library(lubridate)
library(zoo)
-- Attaching core tidyverse packages -------------------------------------------------------------------------------------------------------------------------------------- tidyverse 2.0.0 -- v dplyr 1.1.0 v readr 2.1.4 v forcats 1.0.0 v stringr 1.5.0 v ggplot2 3.4.1 v tibble 3.2.1 v lubridate 1.9.2 v tidyr 1.3.0 v purrr 1.0.1 -- Conflicts -------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Loading required package: viridisLite Attaching package: 'zoo' The following objects are masked from 'package:base': as.Date, as.Date.numeric
Last week in lecture 2 we spent our time highlighting various types of plots and their variants while discerning the proper circumstances of their use. Now that we know which plots to use and when to use them, we can focus on how to clean up your visualizations so each can be presented as its "best self".
Through both lectures and assignments we have already glimpsed at some of the commands and layers we can use to improve upon our graphs whether that is by choosing colour, titles, or legend information. Today we'll explore those options more deeply so you don't have to spend days trying to get your visualizations to look perfect. We'll revisit some old plots and build them up from basics and tweak them to produce this:
![]() |
|---|
| By the time we finish today, we'll know how to manipulate many of the elements of a ggplot. |
Let's start with our PHU caseload data from lecture 1. We'll load it from a .RData file along with some other helpful objects.
# Load some pregenerated data tables for class
# Load Lecture03.RData
load("data/Lecture03.RData")
ls()
# Remind ourselves what covid_phu_window.df looks like
head(covid_phu_window.df)
| public_health_unit | window_mean | start_date | end_date |
|---|---|---|---|
| <fct> | <dbl> | <date> | <date> |
| Algoma | 0 | 2020-01-23 | 2020-02-05 |
| Brant | 0 | 2020-01-23 | 2020-02-05 |
| Chatham Kent | 0 | 2020-01-23 | 2020-02-05 |
| Durham | 0 | 2020-01-23 | 2020-02-05 |
| Eastern Ontario | 0 | 2020-01-23 | 2020-02-05 |
| Grey Bruce | 0 | 2020-01-23 | 2020-02-05 |
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
covid_phu_window.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:5],
start_date >= as.Date("2020-12-01")) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# 4. Geoms
geom_line(linewidth=1)
From our above plot, we can immediately see that we have issues that need remedying:
aes() assignment used.theme()¶Although we haven't directly discussed themes yet, we have seen it appearing here and there in our individual plots. The influence of themes sets and controls the presentation of titles, labels, text, background, legends, etc. You don't directly change the actual information presented in these elements.
Calls to theme() generally take the form of theme(element.component.sub-component = element_*(parameter = value))
Some basic elements include line, rect, text, title, and aspect.ratio. Altering these elements in theme() will alter all elements of their kind (ie all lines, rectangles, text etc.). Alternatively specific element components can be altered more directly. The following table lists most of the possible theme elements and components. They can be as specific as axis.title.x.top. More detailed descriptions can be found here.
| Element | Description | Components | Sub-components | Other |
|---|---|---|---|---|
| axis | x and y axis elements | title, text, ticks, line | x, y, length | top, bottom, left, right |
| legend | all legend elements | background, margin, spacing, key, text, title, position, direction, justification, box | x, y, size, height, width, align, just, spacing | |
| panel | background plotting area | background, border, spacing, grid | x, y, major, minor | |
| plot | entire plot | background, title, subtitle, caption, tax, margin | position | |
| strip | facet labels | background, placement, text, switch | x, y, text, pad | grid, wrap |
You update or set your individual elements using the element_*() functions. Within each element you can typically control aesthetics like fill, colour/color, size, etc. Below is a summary of the elements of concern and their parameters. Specific elements_*() will correspond with the above theme elements.
| element call | description | fill | colour | size | linetype | lineend | arrow | family | face | hjust | vjust | angle | lineheight | margin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| element_line() | formatting of lines | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | ||||||||
| element_text() | formatting of text | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | ||||
| element_rect() | borders and background | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | |||||||||
| element_blank() | draws nothing, and assigns no space |
inherit.blank is an additional parameter you can use in these functions that is normally set to FALSE. When set to TRUE, if a parental layer uses element_blank(), it will cause this element to be blank as well. For example axis.title is the parent of axis.title.x. By setting the inherit.blank = TRUE parameter, you can override/nullify aesthetics assignment layers as long as a parent layers has set those elements to element.blank(). It's a good way to remove additional layer effects if needed!
legend.position option¶Let's start with one of the most oft-intrusive components of our visualizations. While necessary, the legends often default to the right-hands side of our visualizations where they can take up extra horizontal space without requiring much vertical space!
When we are looking to move our legends to different positions, there are 2 areas to consider. The first is the plot area itself which surrounds the data panel (where our data is plotted). The legend.position parameter can take in two types of values. The first is a set of characters: top, bottom, left, and right which relates to the plot area.
Let's start with altering our legend position within the plot area. It's taking up quite a bit of space on the side. We'll worry about the label issues later. For now, let's move the legend to the bottom of the plot. At the same time, let's increase our overall text size for the plot.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot and data from scratch
covid_phu_window.df %>%
# Filter for the top 4 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4],
start_date >= as.Date("2020-12-01")) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# Theme elements
theme(text = element_text(size=20), # set text size to 20
### 1.1.1 Move the legend to the bottom
legend.position = "bottom"
) +
# 4. Geoms
geom_line(linewidth=1)
Instead of moving the legend to the bottom of our plot area, let's use the empty space in the top left corner of the data panel instead by accessing the coordinate system ([0:1], [0:1]) that represents the relative positioning of elements within the panel. This system, follows a c(x, y) setup that matches the data panel with (0,0) representing the lower left corner.
Before we move the legend onto our panel, however, we also have to remember where the legend itself is anchoring when we move it. Are we asking to put the bottom-right corner of the legend into the top-left corner of the plot? Or do we want to match the legend anchor so that the top-left corners are aligned?
Use the legend.justification parameter to properly set this property when moving your legend. It uses the same two-point coordinate concept that we'll use for legend.position.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot and data from scratch
covid_phu_window.df %>%
# Filter for the top 4 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4],
start_date >= as.Date("2020-12-01")) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# Theme elements
theme(text = element_text(size=20), # set text size to 20
### 1.1.1.2 Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02, 0.95),
legend.direction = "horizontal"
) +
# 4. Geoms
geom_line(linewidth=1)
There are a few more things we can do to the plot for now that include updating the background panel to get rid of the grey colour and maybe darkening our axis tick lines and axis lines themselves.
panel.background parameter which expects an element_rect() to define it's properties.panel.grid.* gives us access to the background axes lines using element_line()axis.* elements to to update their format a bit too.plot a little bit by setting the overall background colour.# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot and data from scratch
covid_phu_window.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4]) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# Theme elements
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
### 1.1.2 Update the panel colour and line colours
panel.background = element_rect("white"),
panel.grid.major = element_line("grey"),
panel.grid.minor = element_line("black"),
### 1.1.2 Use a black line for the axes
axis.line = element_line("black"),
axis.text = element_text(colour = "black", face="bold"),
### 1.1.2 Update the plot background colour
plot.background = element_rect("lightblue")
) +
# 4. Geoms
geom_line(linewidth=1)
ggplot2¶In our above example we made alterations to the theme that affected background colour and axis lines. While some of you may lean on the more artistic side you can also use premade themes from both the ggplot2 package and additional packages like ggthemes. Below you'll find a list of the themes from ggplot2.
| Theme | Description |
|---|---|
| theme_gray() | Grey background colour, white grid lines. |
| theme_bw() | White background colour, grey grid lines. |
| theme_linedraw() | White background colour, black lines of various widths |
| theme_light() | White background colour, grey lines of various widths |
| theme_dark() | Dark background colour, grey lines of various widths |
| theme_minimal() | No background annotations, grey lines |
| theme_classic() | White background, x/y axis lines, no grid lines |
| theme_void() | A copmletely empty themes, white background, no axis or grid lines |
If you find a theme that you mostly like, you can use that as a base to your graph before making additional theme() alterations. Let's try a few of these out.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot and save to an object
phu_window.plot <- covid_phu_window.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4]) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# Theme elements
### 1.2.0 Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor y-axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour="black"),
axis.text = element_text(colour="black", face="bold"),
) +
# 4. Geoms
geom_line(size=1)
# plot our object to standard output
phu_window.plot
Warning message:
"Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead."
# Try to add theme_dark() to our plot. What are the consequences?
phu_window.plot + theme_dark()
ggthemes mimics visual styles from multiple sources¶If you are feeling a little more daring with your choices, you can turn to the ggthemes packages to mimic styles from a number of publications such as the Economist, and Wall Street Journal. You can look up a list of the various themes at https://github.com/jrnold/ggthemes.
Like the themes provided by ggplot, you can also make edits to these themes within your scripts.
Two additional package options with different colour palettes and shapes are ggthemr and ggsci.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot
covid_phu_window.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4]) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# Theme elements
### 1.3.0 Switch to the stata theme
theme_stata() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
# 4. Geoms
geom_line(size=1)
Now that we have played around with how to reposition legends, and other elements of your plot, we can discuss how to change the actual text content of your plot. Many times we want to relabel axes or legends, even legend labels. There are a number of layers we can work through but we'll present some of the simplest ways to accomplish this.
labs() command¶Up to this point, we've seen the use of different commands to alter the labels and titles like:
xlab(): Update the x-axis label.ylab(): Update the y-axis label.ggtitle(): Update the plot title.You can also access multiple options within a single call to the labs() layer which accepts the following parameters:
...: a list of name-value pairs that map back to an aesthetic (ie x = "X-axis" or colour = "Population")NULL value to remove a title for a specific label.title, subtitle: the title with a subtitle displayed belowcaption: the text for the caption is displayed in the bottom-right by defaulttag: figure text tag/label usually for figure panels in manuscriptsLet's relabel our plot axis and titles to be more accurate. For now we'll drop the Stata theme and go with our own alteration of theme_minimal(). We'll also include a caption in the bottom right to explain how we display the 14-day rolling mean. You'll also notice that the extremely long legend title will be quite easily fixed!
Note: a quick way of adding space to your titles, is to include the \n character which inserts a carriage return.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot
covid_phu_window.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4]) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# Theme elements
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
### 2.1.0 Add labels to our plot
labs(title = "Mean cases of COVID-19 in a 14-day window across top 4 Ontario Public Health Units\n",
x = "\nWindow date",
y = "Mean cases in 14-day window\n",
colour = "Public Health Unit",
caption = "*14-day rolling mean with date as start of the window") +
# 4. Geoms
geom_line(linewidth=1)
labels parameter¶In last lecture's assignment, you likely would have used the xlim() or ylim() layers to set the axis limits on some of your visualizations. As with all things, there is more than one pathway to our goals.
The scale_*() functions can also be used to set the title, limits, breaks, and labels along your axes. Some of these parameters are redundant and can override other ggplot2 layer commands, depending on the order you have included them.
| Parameter | Equivalent ggplot layer command |
|---|---|
| name | xlab(), ylab(), lab(x), lab(y) |
| limits | xlim(), ylim() |
| break | Determine when axis tick marks are generated |
| labels | Rename the labels present at axis tick marks |
scale_*_date()¶We'll start with a familiar idea we've been working with since lecture 1. A good portion of our pandemic visualizations have focused on looking at data over time. With the scale_x_date() layer, we have set limits, breaks and label formats. Unlike more discrete data sets that we'll see later, the scale_*_date() layer has some very specific parameters that surround the idea of dates and how they are formatted. Last week we took a close look at scale_x_date() in section 3.3.2 of the lecture:
breaks: while you can set specific breaks for dates with this parameter you will need a specific vector of date values that matches your own data groups.date_breaks: a convenient string representation to describe the distance between breaks like "12 days", or "3 years". This parameter will override any information passed to breaks.date_labels: a convenient string representation to describe the format of dates defined by strftime(). Information found hereLet's start by relabeling our x-axis to show us our dates by month and at the same time we set a limit to show us data starting in December of 2020. We've done this before so it should be easy.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot
covid_phu_window.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4]) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# Theme elements
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
# Add labels to our plot
labs(title = "Mean cases of COVID-19 in a 14-day window across top 4 Ontario Public Health Units\n",
x = "\nWindow date",
y = "Mean cases in 14-day window\n",
colour = "Public Health Unit",
caption = "*14-day rolling mean with date as start of the window") +
# 3. Scaling
### 2.2.1 Start looking at data from December 2020 onwards
scale_x_date(limits = c(as.Date("2020-12-01"), # Set a start date for our limit
as.Date(max(covid_phu_window.df$start_date))), # Identify the last date and use that
date_breaks = "1 month", # How will we break up the dates?
date_labels = "%b-%Y") + # How will we format labels
# 4. Geoms
geom_line(linewidth=1)
Warning message: "Removed 1252 rows containing missing values (`geom_line()`)."
element_text() function¶At this point you'll notice that our x-axis text is also pretty unclean. Let's revisit the axis.text.x component of theme to deal with this. There are a few things we can influence with this element_text() including:
angle: use this to rotate text from a horizontal position, in a counter-clock-wise direction.vjust and hjust: the vertical and horizontal justification of your text as a value from 0 to 1, where 0.5 is "centered".family: determine the font usedface: determine the font face (plain, bold, italic, bold.italic)size, lineheight, color, colour: alter other characteristics of your text displaydebug: a handy tool that draws the a border around your complete text area and a point where each label is anchored. Great for helping to get that "perfect" look on your figures.Let's fix up our current visualization by rotating our text and right-justifying it.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot
covid_phu_window.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4]) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) + # Reorder our PHUs
# Theme elements
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
### 2.2.1.1 Adjust the x-axis text
axis.text.x = element_text(angle = 90, # Rotate 90
hjust = 1, # Right-justify
vjust = 0.5) # Centre text "vertically" on axis tick
) +
# Add labels to our plot
labs(title = "Mean cases of COVID-19 in a 14-day window across top 4 Ontario Public Health Units\n",
x = "\nWindow date",
y = "Mean cases in 14-day window\n",
colour = "Public Health Unit",
caption = "*14-day rolling mean with date as start of the window") +
# 3. Scaling
# Start looking at data from December 2020 onwards
scale_x_date(limits = c(as.Date("2020-12-01"), # Set a start date for our limit
as.Date(max(covid_phu_window.df$start_date))), # Identify the last date and use that
date_breaks = "1 month", # How will we break up the dates?
date_labels = "%b-%Y") + # How will we format labels
# 4. Geoms
geom_line(linewidth=1)
Warning message: "Removed 1252 rows containing missing values (`geom_line()`)."
limits and breaks¶Much of your quantitative data will usually come as a continuous series of values. We've played around with these scales before using scale_*_log10 in lecture and assignment. Similarly, we can alter continuous axes without necessarily transforming them. This is accomplished via the scale_*_continuous() layer. With these types of layers, we have access to parameters like:
breaks, minor_breaks: a numeric vector of positions OR a function that takes the limits as input and returns breaks as output for the parameter specified.n.breaks: an integer to suggest the number of major breaks. The plotting algorithm may alter this value to ensure nice break labels. This will only work if breaks = waiver() (the default for breaks).labels: a character vector matching labels to the major breaks.limits: a numeric vector c(lower, upper)Let's break our y-axis into major tick-marks of every 500 cases by altering scale_y_continuous() with the seq() function. At the same time, let's remove the title from our legend by setting the guide in labs() to a NULL value.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
# Build our plot and save to an object for later use
phu_window.plot <- covid_phu_window.df %>%
# Reorder the PHU factor here
mutate(public_health_unit = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:4]) %>%
# redirect the filtered result to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, # Set our x and y axes
colour = public_health_unit) +
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
# Adjust the x-axis text
axis.text.x = element_text(angle = 90, # Rotate 90
hjust = 1, # Right-justify
vjust = 0.5) # Centre text "vertically" on axis tick
) +
# Add labels to our plot
labs(title = "Mean cases of COVID-19 in a 14-day window across top 4 Ontario Public Health Units\n",
x = "\nWindow date",
y = "Mean cases in 14-day window\n",
colour = NULL,
caption = "*14-day rolling mean with date as start of the window") +
# 3. Scaling
# Start looking at data from July 2020 onwards
scale_x_date(limits = c(as.Date("2020-12-01"), # Set a start date for our limit
as.Date(max(covid_phu_window.df$start_date))), # Identify the last date and use that
date_breaks = "1 month", # How will we break up the dates?
date_labels = "%b-%Y") + # How will we format labels
### 2.2.2 Change our y-axis breaks
scale_y_continuous(limits = c(0, 3500), breaks = seq(0, 3500, 500)) +
# 4. Geoms
geom_line(linewidth=1)
# plot our object to standard output
phu_window.plot
Warning message: "Removed 1252 rows containing missing values (`geom_line()`)."
scale_*_discrete()¶For various reasons, you may have categorical or grouped data with unusual names. It may be convenient to code your data this way but letting ggplot2 assign these to your axes or labels may not be suitable. Instead, you can manually rename them using the labels parameter with your various scale_*_discrete() layers.
When manually labeling your categories be sure to supply a vector with the correct number of arguments to match the number of levels in your categories or groups.
Let's revisit some of our age-grouped data from last week which was visualized as grouped violin plot with inset boxplots. Recall that our data was labelled by the variable age_group using "0 to 4", "5 to 11", etc. We'll modify those in the plot (rather than the data frame) to a format that looks like "0-4", "5-11", etc.
# Let's briefly review the dataset
str(covid_demographics_total.df, give.attr = FALSE)
gropd_df [476 x 15] (S3: grouped_df/tbl_df/tbl/data.frame) $ period : chr [1:476] "recent" "recent" "recent" "recent" ... $ from_date : chr [1:476] "18-Feb-23" "18-Feb-23" "18-Feb-23" "18-Feb-23" ... $ to_date : chr [1:476] "04-Mar-23" "04-Mar-23" "04-Mar-23" "04-Mar-23" ... $ public_health_unit : chr [1:476] "Algoma" "Algoma" "Algoma" "Algoma" ... $ age_group : Factor w/ 7 levels "0 to 4","5 to 11",..: 1 3 4 5 2 6 7 1 3 4 ... $ total_cases : num [1:476] 4 2 15 24 1 19 8 2 1 4 ... $ total_rate : num [1:476] 83.6 22.2 57.6 84.4 12.5 ... $ total_hospitalizations_count: num [1:476] 0 0 0 2 0 8 2 0 0 0 ... $ total_hospitalizations_rate : num [1:476] 0 0 0 7 0 24.3 25.1 0 0 0 ... $ male_cases : num [1:476] 2 0 4 9 0 11 4 1 0 0 ... $ female_cases : num [1:476] 2 2 11 15 1 8 4 1 1 4 ... $ percent_cases : num [1:476] 0.0548 0.0274 0.2055 0.3288 0.0137 ... $ percent_hospitalizations : num [1:476] 0 0 0 0.167 0 ... $ percent_male_cases : num [1:476] 0.5 0 0.267 0.375 0 ... $ percent_female_cases : num [1:476] 0.5 1 0.733 0.625 1 ...
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=12)
# Build and save the plot for later use
demographics.plot <- covid_demographics_total.df %>%
# Ungroup this dataframe to clean it up a little
ungroup() %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(3,4), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Plot the data as a grouped violin plot with inset boxplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_PHU_total) +
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
# Add labels to the plot
labs(title = "Percent cases and hospitalizations by proportion per PHU across age group - cumulative across pandemic",
x = "\nAge group",
y = "Proportion of reported PHU data\n",
caption = "\n*Age group values are calculated as a percentage of total cases or hospitalizations within a PHU") +
# 3. Scaling
scale_y_continuous(limits = c(0, 0.5)) + # Set the limits of our y-axis
scale_colour_manual(values=c("black", "black"))+ # we'll need this to fix our boxplot outlines
### 2.2.3 Set the labels of our x-axis categories
scale_x_discrete(labels=c("0-4", "5-11", "12-19", "20-39", "40-59", "60-79", "80+"))+
# 4. Data
# multi-factor violin plots but keep the width consistent
geom_violin(scale="width", aes(fill=stat_group)) +
# Boxplot but smaller width so they reside "within" the violin plot
geom_boxplot(aes(colour = stat_group), width=0.2,
position = position_dodge(width=0.9),
outlier.shape=NA) + # Remove the outliers
# Add in all of the data points
geom_quasirandom(dodge.width = 0.85, aes(group=stat_group), alpha = 0.8)
# Show the plot
demographics.plot
guide parameter or guides() layer¶Nearly there with updating this plot! We've relabeled the the x-axis categories but our legend title isn't quite there. Previously we used the labs() layer to handle this aspect but this time around we really want to also alter the labels of our data categories to "% cases" and "% hospitalizations". Before we get into that, let's talk a little more about legends.
Normally you can let ggplot2 take the wheel and automatically generate guides for you. Whenever you set colour/fill/linetype etc in your aesthetics, this will generate a legend. When the groups are mapped in the same way (i.e. the same labels!) between different aesthetics, the legends may be combined.
There will be instances, however, when you need to adjust your legend or get rid of it all together. This could range from titles, to combining your guides across different aesthetics commands. There are a number of ways to achieve the same result when working with guides and we'll go through a number of examples. First, however, we should discuss the types of legends:
| guide | short call | Description |
|---|---|---|
| guide_legend() | legend | The base prototype of the legend which integrates how geoms are mapped into values. |
| guide_bins() | bins | A binned version of legends which places ticks between keys and has its own small axis |
| guide_colourbar() | colourbar | For mapping continous colour/fill scales from using scale_fill_*() and scale_colour_*(). |
| guide_coloursteps() | coloursteps | A version of guide_colourbar() except for binned colour and fill scales rather than gradients. |
| none | NA | Suppress the legend as specified |
We briefly saw the use of a colourbar in our last lecture when using a continuous variable to set the colour of our barplots. Each type has it's own use depending on how you want to describe your data. Within each of the guide types, you can update parameters about text within the legend.
| Component | Sub-components |
|---|---|
| title | name, position, theme, hjust, vjust |
| label | name, position, theme, hjust, vjust |
| key | width, height |
| order | you can determine the order of the guide amongst others using integers [1:99]. 0 sets order by an algorithm |
| other | direction of guide, number of rows/cols |
So where can you use these methods?
scale_*() to set guide parameters¶Within each scale_*() you declare you can set the parameter guide to one of the above guide types. To exclude a legend for that particular type, set the value to none.
Some layer options you may work with here are scale_fill_discrete() and scale_shape_manual() and scale_colour_continuous() - some of which we've seen in previous lectures. Notice that fill, shape and colour are all aesthetic parameters we can change in our data mapping.
Let's update our fill guide to change the legend title to "Data category" and relabel our categories to "% cases" and "% hospitalizations" as previously discussed.
# Adjust the fill scale layer for the demographics plot
demographics.plot +
### 2.3.1 Set the fill guide details
scale_fill_discrete(name = "Data category", # Guide name
labels = c("% cases", "% hospitalizations")) # Relabel the categories
guides() layer to manipulate multiple guides¶While our output is nearly correct, there is still a problem! Now have two sets of legends! If you look carefully at the ggplot code, you'll see that we set aesthetics in three places:
geom_violin(scale="width", aes(fill=stat_group))geom_boxplot(aes(colour = stat_group)...geom_quasirandom(dodge.width = 0.85, aes(group=stat_group), alpha = 0.8)Across 3 geoms we've generated 3 aesthetic groups: fill, colour and group. Remember when we said that ggplot would take the wheel and generate legend/guide information automatically? Well this is a case where all three are mapping by the same variable so they get combined into a single legend. When we took the time to change the fill guide, however, it was broken away from the other two guides.
In a case like this we use the guides() call to set multiple guides at once using the scale types as parameters ie colour, size, shape. Much like labs() it gives us centralized access to guide format and settings, allowing us to quickly rectify our problem. In this case, we really don't need the group or colour aesthetics, so we'll simply get rid of them.
# Adjust the fill scale layer for the demographics plot
demographics.plot +
### 2.3.2 Use the guides() layer and get rid of the scale_fill_discrete() layer
guides(fill = guide_legend(title = "Data category"),
colour = "none", group = "none") +
# Set the fill guide details
scale_fill_discrete(labels = c("% cases", "% hospitalizations")) # Relabel the categories
![]() |
|---|
| Well you could rely on the basic colour palette but you're better off picking your own colours! |
Up to this point, we've danced around the idea of colour in our lectures and assignments. For those of you that aren't familiar with your colour choices, here is a quick breakdown of colour palettes.
A common thing to want to do is to change colours from ggplot2's default rainbow palette. There are many reasons to change a colour palette including
When we talk about colour palettes and their purpose, there are 3 main types.
Sequential - implies an order to your data - i.e. light to dark implies low values to high values. There are helpful when working with continuous data scales of increasing value e.g. heatmaps.
# Load the RColorBrewer library
library(RColorBrewer)
# display the sequential colour palettes
display.brewer.all(type = "seq")
Diverging - low and high values are extremes, and the middle values are important. This palette will goes from light to dark, middle to outsides with 3 colours mainly used.
# Display the diverging colour palettes
display.brewer.all(type = "div")
Qualitative - there is no quantitative relationship between colours. This is usually used for categorical data when you want each category to be visualized distinctly.
display.brewer.all(type = "qual")
Let's test one of the RColorBrewer palettes out on our data. We'll add it as a layer to phu_window.plot using scale_colour_brewer() to override the colour mappings defined in the aes() layer of the plot. Some parameters we can keep in mind:
type: determines the kind of palette as sequencial (seq), diverging (div) or qualitative (qual)palette: accepts a string name for a palette or an integer that combines with type to pick a paletteNote that colour palettes are not vector recycled when plotting in ggplot. This means if you don't supply enough colours to match your groups, then unassigned groups will simply be cut off or not displayed.
More information on palette order and other parameters can be found here
phu_window.plot +
# Use the Dark2 palette
scale_colour_brewer(palette=...)
phu_window.plot +
# Pick a qualitative colour palette
scale_colour_brewer(type=..., palette=...)
You can always choose a vector of your own colors using this 'R color cheatsheet' (https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf).
Names of colours as well as hex colour codes are accepted. You can supply a manual list using the scale_*_manual() command.
phu_window.plot +
# Set your own manual colour choices
scale_colour_manual(values=c(..., "cornflowerblue", "orange", ...))
viridis package¶The viridis package also has some nice color palettes (https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). These colour packages are diverging palettes meant to help highlight true colour change across continuous scales. You've seen it come up a few times in our data and these palettes do well for small categorical sets but begin to blend as our number of categories increase in size.
The main calls we can use follow the format scale_*_viridis_c/d/b() where the "c/d/b" represents continuous/discrete/binned data and the types of additional arguments that can be passed on to augment the call. There are some additional parameters that can be used to set the colours when called:
option: accepts one of 8 possible character representing 5 colour scales; "magma"/"A", "inferno"/"B", "plasma"/"C", "viridis"/"D" or "cividis"/"E".direction: sets the direction of the palette order. Use -1 to reverse it.phu_window.plot +
# Use a colour-blind friendly palette
scale_colour_viridis_d(...)
after_scale to set an aesthetic mapping dependent upon another one¶There may be times when you want to link certain aesthetics to each other like colour and fill for instance. Perhaps you want to set both to a custom value but one as a lighter shade. Rather than set both mappings to a data variable and then using a scale layer to set the values, you can set one mapping as dependent upon another. There are transformations and mappings of data to aesthetics happening under the hood at 3 stages when evaluating a ggplot object.
after_stat() to access this data.scale_colour_manual()). From there, you can dictate how another aesthetic mapping will determine its values.Using the after_scale() function will postpone an aesthetic mapping until after the data has been scaled. As we'll see next, when used properly, you will tie the aesthetics of one aspect to the aesthetics of another. There are a number of cool ways you can utilize after_stat() as well to add finishing touches like counts/values to your graphs
Going back to our previous boxplot, we'll utilize after_scale() to link the fill values of our violin plot to the colour set of the same violin plot. At the same time we'll de-couple those aesthetics from the ones we use in the inset boxplot of the visualizastions. Enough talk though, let's see what that looks like.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=12)
# Build and save the plot for later use
demographics.plot <- covid_demographics_total.df %>%
# Ungroup this dataframe to clean it up a little
ungroup() %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(3,4), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Plot the data as a grouped violin plot with inset boxplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_PHU_total) +
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
# Add labels to the plot
labs(title = "Percent cases and hospitalizations by proportion per PHU across age group",
x = "\nAge group",
y = "Proportion of reported PHU data\n",
caption = "\n*Age group values are calculated as a percentage of total cases or hospitalizations within a PHU") +
### Use the guides() layer and get rid of the scale_fill_discrete() layer
guides(fill = "none", group = "none") +
# 3. Scaling
scale_y_continuous(limits = c(0, 0.5)) + # Set the limits of our y-axis
# Set the labels of our x-axis categories
scale_x_discrete(labels=c("0-4", "5-11", "12-19", "20-39", "40-59", "60-79", "80+")) +
# Set the colour legend
scale_colour_discrete(name = "Data category", labels = c("% cases", "% hospitalizations")) +
# 4. Data
# multi-factor violin plots but keep the width consistent
### 3.4.0 Link your fill to the colour aesthetic
geom_violin(scale="width",
aes(colour = stat_group, fill=...),
lwd = 1.5) +
# Boxplot but smaller width so they reside "within" the violin plot
geom_boxplot(aes(fill = stat_group), width=0.2,
position = position_dodge(width=0.9),
outlier.shape=NA) + # Remove the outliers
# Add in all of the data points
geom_quasirandom(dodge.width = 0.85, aes(group=stat_group), alpha = 0.8)
# Show the plot
demographics.plot
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=12)
covid_demographics_total.df %>%
# Ungroup this dataframe to clean it up a little
ungroup() %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(3,4), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Plot the data as a grouped violin plot with inset boxplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_PHU_total) +
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
# Add labels to the plot
labs(title = "Percent cases and deaths by proportion per PHU across age group",
x = "\nAge group",
y = "Proportion of reported PHU data\n",
caption = "\n*Age group values are calculated as a percentage of total cases or deaths within a PHU") +
### Use the guides() layer and get rid of the scale_fill_discrete() layer
guides(fill = "none", group = "none") +
# 3. Scaling
scale_y_continuous(limits = c(0, 0.5)) + # Set the limits of our y-axis
# Set the labels of our x-axis categories
scale_x_discrete(labels=c("0-4", "5-11", "12-19", "20-39", "40-59", "60-79", "80+")) +
# Set the colour legend
### 3.0.0 Comprehension question
scale_colour_manual(name = "Data category", labels = c("% cases", "% hospitalizations"),
values = c(...)) +
scale_fill_manual(name = "Data category", labels = c("% cases", "% hospitalizations"),
values = c(...)) +
# 4. Data
# multi-factor violin plots but keep the width consistent
### Link your fill to the colour aesthetic
geom_violin(scale="width",
aes(colour = stat_group, fill=after_scale(alpha(colour, 0.3))),
lwd = 1.5) +
# Boxplot but smaller width so they reside "within" the violin plot
geom_boxplot(aes(fill = stat_group), width=0.2,
position = position_dodge(width=0.9),
outlier.shape=NA) + # Remove the outliers
# Add in all of the data points
geom_quasirandom(dodge.width = 0.85, aes(group=stat_group), alpha = 0.8)
![]() |
|---|
| It's all about figuring out how to add those finishing touches |
After preparing your visualization you may consider adding extra annotations. These are usually layers that don't affect the aesthetics or data of your visualization but depending on how you add them and the package you are using this isn't strictly true. For the most part, however, let's consider your annotations as separate from your plot.
We've already dabbled in annotations since the first lecture but now we're going to look deeply at how these work and some more advanced annotation packages.
annotate() plots with shapes, text, and arrows.¶Sometimes you need to add some additional text, or shapes to your graph that aren't necessarily a part of the data itself. in other words you would like to annotate your plot. To accomplish this you can use the annotate() function which will essentially add geoms to your plot. While these annotations can affect the axis limits of your plot if required to show your annotation(s), they won't affect the legends nor be treated as actual data - just an overlay to your plot.
The annotate() geom has the following parameters:
| Parameter | Description |
|---|---|
| geom | Can be any number of possible values including "text", "rect", "segment", "curve", etc. |
| xmin, xmax, ymin, ymax, xend, yend | Positioning aesthetics where at least one of these must be defined. |
| ... | Other aesthetics arguments that can be passed along like color = "red" |
| na.rm | If FALSE, missing values are removed with a warning otherwise they are silently removed |
Up to this point we've already added some annotations to this plot in previous lectures. Today we'll update a few bits of text and lines segments with arrows instead of boxes.
When naming your geom parameter, you can essentially use whatever geom_*() are available within ggplot. For instance, we'll annotate using a geom_curve() by setting geom = "curve". Some of the geom_curve() parameters include:
x, xend, y, yend: the start and end coordinates of your curve.lineend: the line end style (round, butt, square).curvature: an integer describing the type of curvature joining start to end. angle: an amount (0 to 180) to skew the control points of the curve.# Update our phu_window.plot with some annotations and save it to a new object
phu_window_annotate.plot <-
phu_window.plot +
# 2. Aesthetics
# Move our legend to the right side of the panel
theme(legend.justification = c(1,1), legend.position = c(0.98, 0.98)) +
# 3. Scaling
# Stretch out the x-axis scale a bit to fit our labels
scale_x_date(limits = c(as.Date("2020-12-01"), # Set a start date for our limit
as.Date("2023-03-01")), # Identify the last date and use that
date_breaks = "1 month", # How will we break up the dates?
date_labels = "%b-%Y") +
# Winter 2020 lockdown
geom_text(aes(x=as.Date("2020-12-26") + 7, label = "Province-wide lockdown", y=2400),
angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-12-26"), xmax=as.Date("2020-12-26") + 14,
ymin=-Inf, ymax=Inf, fill="red", alpha=0.2) +
# Spring 2021 Lockdown
geom_text(aes(x=as.Date("2021-04-03") + 7, label = "Province-wide lockdown", y=2400),
angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2021-04-03"), xmax=as.Date("2021-04-03") + 14,
ymin=-Inf, ymax=Inf, fill="red", alpha=0.2) +
# Omicron arrives
### 3.1.0 Annotate using a curve
geom_text(x=as.Date("2021-09-25"), label = "First Omicron\ncases reported\nin Ontario", y=1000,
hjust=1, vjust = 0, size=10, colour="black") +
annotate("curve", # Make a curve
x=as.Date("2021-10-01"), xend = as.Date("2021-11-28"), # Set the x-coordinates
y=..., yend=..., # Set the y-coordinates
lineend = "round", curvature = ..., # Set the line characteristics
colour="red", linewidth = 1, arrow = ...) + # Add an arrow at the end
# Ontario ends proper PCR testing
geom_text(aes(x=as.Date("2022-02-10"), label = "Ontario reduces public\nPCR COVID-19 testing", y=2500),
hjust=0, size=10, colour="black") +
annotate("segment", x=as.Date("2022-02-01"), xend = as.Date("2021-12-31"),
y=2500, yend=2500, colour="red", linewidth = 1, arrow = arrow())
# display our plot
phu_window_annotate.plot
Unlike the annotations we just discussed, you may wish to directly label or output information based on your data from the plot. This can be in the form of error bars, or data labels. Sometimes you may want to include your sample size or further highlight your outliers.
directlabels package¶If for some reason you needed to label your plot data directly, the geom_dl() layer from the directlabels packages can be quite useful. The package will replace your colour legends with direct labeling instead since this can (sometimes) be a little cleaner and less confusing. Parameters you should set when working with geom_dl() are:
method: this is the positioning method for the direct label placement and MUST be specified.list() to update additional attributes like fontsize (cex), fontfamily, rotation (rot) etc.aes(): like any geom, you can specify aesthetics information including the labels and colour.Note that adding direct labels this way, however, will not remove the corresponding legend from the plot. It will simply add extra geoms to your plot.
phu_window_annotate.plot +
# Update the labeling of our lines
geom_dl(method=..., aes(label=...))
direct.label() feature¶For simplicity, you can also call on direct.label(), which will automatically remove the associated legend from your plot. You can use it by providing the following parameters:
p: the ggplot object you've already created.method the positioning method as with geom_dl().dl.combine() and include several positioning methods at the same time.list() to update additional attributes like fontsize (cex), fontfamily, rotation (rot) etc. To do this, you must also include your method in the list, after your attribute changes.# Use direct.label() to reformat your plot
...(p = phu_window_annotate.plot, # Provide a plot object
list(cex=2, "last.bumpup")) # Detail the format information for your labeling
gghighlight()¶You may find yourself in an instance where you have too many data groups to present (ie 34 PHUs) but would still like the audience to get an overview of your dataset while focusing on a few items. As we have done in the past, you could break groups out using facet_*() but that isn't always ideal. We have also filtered for the top PHUs from a previously generated list but then we get no sense of the other PHUs at all.
Instead you can use the gghighlight() layer from the package of the same name. Some helpful parameters from this layer include:
...: the expressions you will use to filter data (ie your predicate) which will be passed to dplyr::filter().max_highlight: the maximum number of series to highlight.unhighlighted_params: the aesthetics for your unhighlighted groups.use_group_by: if TRUE, this function will use dplyr::group_by() to evaluate your predicate.use_direct_label: if TRUE, labels will be added directly to the plot instead of using a legend.label_key: the column name for label aesthetics.label_params: a list of aesthetics customizations like size.Let's plot all of our PHU data onto the graph and only highlight the top 4 PHUs as before. We'll have to do some extra fiddling to make it work just right.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
ggh <-
# Build our plot and save to an object
covid_phu_window.df %>%
# Filter for the top 5 infected PHUs
mutate(public_health_unit = fct_reorder(public_health_unit, window_mean, .desc=TRUE))
phu_cases.plot <-
# redirect the filtered result to ggplot
# 1. Data
ggplot(ggh) +
# 2. Aesthetics
aes(x = start_date, y = window_mean, colour = public_health_unit) +
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(1,1),
legend.position = c(0.98,0.98),
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
# Adjust the x-axis text
axis.text.x = element_text(angle = 90, # Rotate 90
hjust = 1, # Right-justify
vjust = 0.5) # Centre text "vertically" on axis tick
) +
# Add labels to our plot
labs(title = "Mean cases of COVID-19 in a 14-day window across top 5 Ontario Public Health Units\n",
x = "\nWindow date",
y = "Mean cases in 14-day window\n",
colour = "Public Health Unit",
caption = "*14-day rolling mean with date as start of the window") +
# 3. Scaling
# Start looking at data from July 2020 onwards
scale_x_date(limits = c(as.Date("2020-12-01"), # Set a start date for our limit
as.Date("2023-03-01")), # Set the end date in your limit
date_breaks = "1 month", # How will we break up the dates?
date_labels = "%b-%Y") + # How will we format labels
# Change our y-axis breaks
scale_y_continuous(limits = c(-10, 3500), breaks = seq(0, 3500, 500)) +
### -------------------- Section 4.3.0 highlighting specific geoms -------------------- ###
# 4. Geoms
### Plot all of our public health unit data
...(linewidth=1, aes(x=start_date, y=window_mean, group = public_health_unit, colour = public_health_unit)) +
### Highlight just the top 4 PHUs
...(public_health_unit %in% phu_by_total_cases_desc[1:4], # Filter your data
use_group_by = FALSE, # Don't group it
label_params = list(size = 10)) + # Set the labels to size 10
### -------------------- Section 4.3.0 highlighting specific geoms -------------------- ###
# 8. Annotations
# Winter 2020 lockdown
geom_text(aes(x=as.Date("2020-12-26") + 7, label = "Province-wide lockdown", y=2400),
angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-12-26"), xmax=as.Date("2020-12-26") + 14,
ymin=-Inf, ymax=Inf, fill="red", alpha=0.2) +
# Spring 2021 Lockdown
geom_text(aes(x=as.Date("2021-04-03") + 7, label = "Province-wide lockdown", y=2400),
angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2021-04-03"), xmax=as.Date("2021-04-03") + 14,
ymin=-Inf, ymax=Inf, fill="red", alpha=0.2) +
# Omicron arrives
# Annotate using a curve
geom_text(x=as.Date("2021-09-25"), label = "First Omicron\ncases reported\nin Ontario", y=1000,
hjust=1, vjust = 0, size=10, colour="black") +
annotate("curve", # Make a curve
x=as.Date("2021-10-01"), xend = as.Date("2021-11-28"), # Set the x-coordinates
y=1000, yend=100, # Set the y-coordinates
lineend = "round", curvature = -0.5, # Set the line characteristics
colour="red", linewidth = 1, arrow = arrow()) + # Add an arrow at the end
# Ontario ends proper PCR testing
geom_text(aes(x=as.Date("2022-02-10"), label = "Ontario reduces public\nPCR COVID-19 testing", y=2500),
hjust=0, size=10, colour="black") +
annotate("segment", x=as.Date("2022-02-01"), xend = as.Date("2021-12-31"),
y=2500, yend=2500, colour="red", linewidth = 1, arrow = arrow())
# plot our data
phu_cases.plot
geom_*()¶When working with bar or line plots where you may have generated information such as a mean with standard deviation, you can plot that information with geom_errorbar(). Unlike annotations from above this is a specific geom and is treated by the plot like any other geom_*() we've encountered. Under it's aes() argument you can specify the ymin and ymax values or data sources. If you already have pregenerated columns for these values, you can use them directly or you can calculate them on the fly if you have just a mean and standard deviation.
There are alternative formats of the geom_errorbar() as well:
| geom | Description |
|---|---|
| geom_crossbar() | A hollow box with the middle indicated by a horizonal line. |
| geom_errorbarh() | Horizontal versions of the errorbar. |
| geom_linerange() | Draws an interval using a single vertical line. |
| geom_pointrange() | Same as a linerange except an additional point is plotted in the middle of the range. |
Let's recreate one of our plots from lecture 2 using summary data and some of these new geoms!
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=10)
covid_demographics.plot <-
covid_demographics_total.df %>%
# Ungroup this dataframe to clean it up a little
ungroup() %>%
# Filter for cumulative data only
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_deaths
pivot_longer(cols=c(3,4), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Group the data both by age group and then stat group
group_by(age_group, stat_group) %>%
# Generate some summary statistics
summarise(mean = mean(percent_PHU_total),
sd = sd(percent_PHU_total),
median = median(percent_PHU_total)) %>%
# Plot the data as a mixture of multiple geoms
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = mean, linetype=stat_group) +
# Themes
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
# Add labels to our plot
# Add labels to the plot
labs(title = "Percent cases and deaths by proportion per PHU across age group",
x = "\nAge group",
y = "Proportion of reported PHU data\n",
caption = "\n*Age group values are calculated as a percentage of total cases or deaths within a PHU") +
# 4. Data
### 5.1.0 Add an errorbar to represent the standard deviation range
...(width = 0.2, aes(y = mean, ymin = ..., ymax = ..., colour=stat_group), size=1) +
### 5.1.0 Add a point to represent the mean of each error bar
geom_point(aes(y=mean, shape=age_group, group = stat_group), size = 5)
covid_demographics.plot
covid_demographics.plot +
# Add a line to connect our age groups
geom_line(aes(x=age_group, y=..., group = stat_group, colour=stat_group), linewidth=1)
Now that we've gone and built ourselves an extremely strange plot, (remember, this is just an example) there are a few things we can fix/play with.
age_group is on top.guide parameter or guides() layer¶We've already looked at some helpful legend alterations pertaining to positioning and text relabeling in section 2.0.0. Now we'll explore some of the remaining tips and tricks when it comes to working with multiple legends within your plot.
Recall that within each of the guide types, you can update parameters about text within the legend.
| Component | Sub-components |
|---|---|
| title | name, position, theme, hjust, vjust |
| label | name, position, theme, hjust, vjust |
| key | width, height |
| order | you can determine the order of the guide amongst others using integers [1:99]. 0 sets order by an algorithm |
| other | direction of guide, number of rows/cols |
We'll take a closer look at the order parameter next using our above visualization of the age-grouped data.
covid_demographics.plot +
# 2. Aesthetics
### 5.2.0 Set our guide positions for linetype and colour to 2
guides(linetype = "none",
colour = guide_legend(title="Indicator", order=...)) +
# 3. Scaling
### 5.2.0 rename our x-axis labels
scale_x_discrete(labels=covid_demographics_total.df$age_group %>% levels() %>% as.character() %>%
str_replace_all(pattern=" to ", replacement = "-") # Use string replacement to change our labels
) +
### 5.2.0 ggplot only adds 6 shapes automatically so we need to add more manually
scale_shape_manual(values=c(1:nlevels(covid_demographics_total.df$age_group)),
guide=guide_legend(title = "Age group", order=...)) +
# Set the colour legend
scale_colour_discrete(name = "Indicator", labels = c("% cases", "% hospitalizations")) +
# 4. Geoms
# Add a line to connect our age groups
geom_line(aes(x=age_group, y=mean, group = stat_group, colour=stat_group), linewidth=1)
override.aes¶Before we leave the guides() section, we should update our plot one last time. When you are working with so many shapes, sometimes, they can show up a little smaller than you want. You may wish to increase their size on the plot but that may disproportionately increase their size on the legend. If you think about the legend similarly to a plot itself, then you can grasp how the override.aes parameter might work.
To adjust some of the aesthetic elements of your plot legend, provide a named list to the override.aes parameter. You can use aes parameters like size and colour to adjust how your legends display information rather than drawing their parameters from the plot itself. We'll be applying this parameter within our guides.
At the same time, we'll update our points to be larger and bolder/thicker by altering its stroke parameter.
covid_demographics.plot +
# 2. Aesthetics
# Set our guide positions for linetype and colour to 2
guides(linetype = "none",
colour = guide_legend(title="Indicator", order=2)) +
# 3. Scaling
# rename our x-axis labels
scale_x_discrete(labels=covid_demographics_total.df$age_group %>% levels() %>% as.character() %>%
str_replace_all(pattern=" to ", replacement = "-") # Use string replacement to change our labels
) +
# ggplot only adds 6 shapes automatically so we need to add more manually
### 5.2.1 Override the size of the shapes in our legend
scale_shape_manual(values=c(1:nlevels(covid_demographics_total.df$age_group)), # Set values based on number of levels
guide=guide_legend(title = "Age group", order=1,
# Increase the shape size and line thickness
... = list(...))) +
# Set the colour legend
scale_colour_discrete(name = "Indicator", labels = c("% cases", "% hospitalizations")) +
# 4. Geoms
# Add a line to connect our age groups
geom_line(aes(x=age_group, y=mean, group = stat_group, colour=stat_group), linewidth=1) +
### 5.2.1 Update the points to be larger and thicker (Note that the previous geom_point layer still exists!)
geom_point(aes(y=mean, group = stat_group, shape=age_group), size = 6, stroke = 1.5)
ggforce package annotates with simple geom_mark_*() options¶The ggforce() package brings helpful geoms and functions to ggplot2 that can quickly annotate groups of data within your plots. These layers work with ggplot2 like other geom_*() layers so you can add them into your plots quite simply. These objects can also accept aesthetics mappings (including the ability to filter groups) amongst many other theme-esque parameters and are added in an automated fashion. More information can be found here
| geom | Description |
|---|---|
| geom_mark_circle() | Add circles to all of your data groups |
| geom_mark_rect() | Add rounded-corner rectangles to your data groups |
| geom_mark_ellipse() | Add ellipses to all of your data groups |
| geom_mark_hull() | Add a more tightly-fitted shape/blob (aka hull) around your data groups |
You can also add custom shapes, specifying their type, location, etc and extensions to the facet_*() group of layers allow you to facet by different columns, zoom in on part of a graph as a facet, and split facets into multiple plots.
Let's add some ellipses to our plot and exchange our geom_line() for a smoother geom_bspline(). More about the geom_bspline() parameters can be found here
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=10)
demographics_summary.plot <-
covid_demographics_total.df %>%
# Ungroup this dataframe to clean it up a little
ungroup() %>%
# Filter for cumulative data only
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_deaths
pivot_longer(cols=c(3,4), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Group the data both by age group and then stat group
group_by(age_group, stat_group) %>%
# Generate some summary statistics
summarise(mean = mean(percent_PHU_total),
sd = sd(percent_PHU_total),
median = median(percent_PHU_total)) %>%
# Plot the data as a mixture of multiple geoms
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = mean, linetype=stat_group) +
# Themes
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
# Add labels to the plot
labs(title = "Percent cases and deaths by proportion per PHU across age group",
x = "\nAge group",
y = "Proportion of reported PHU data\n",
caption = "\n*Age group values are calculated as a percentage of total cases or deaths within a PHU") +
# Set our guide positions for linetype and colour to 2
guides(linetype = "none",
colour = guide_legend(title="Indicator", order=2)) +
# 3. Scaling
# rename our x-axis labels
scale_x_discrete(labels=covid_demographics_total.df$age_group %>% levels() %>% as.character() %>%
str_replace_all(pattern=" to ", replacement = "-") # Use string replacement to change our labels
) +
# ggplot only adds 6 shapes automatically so we need to add more manually
# Override the size of the shapes in our legend
scale_shape_manual(values=c(1:nlevels(covid_demographics_total.df$age_group)), # Set values based on number of levels
guide=guide_legend(title = "Age group", order=1,
override.aes = list(size=7, stroke = 0.8))) +
# Set the colour legend
scale_colour_discrete(name = "Indicator", labels = c("% cases", "% hospitalizations")) +
# 4. Data
# Add an errorbar to represent the standard deviation range
geom_errorbar(width = 0.2, aes(y = mean, ymin = mean-sd, ymax = mean+sd, colour=stat_group), size=1) +
# Add a line to connect our age groups
### 5.3.0 replace our line with a bezier line that is a little smoother and goes through most of the points
...(aes(group = stat_group, colour=stat_group), size=1) +
### 5.3.0 Add ellipses to 2 specific age groups to highlight what we care about
...(aes(group = age_group, filter = ... %in% c("20 to 39", "80+"), label=age_group),
fill="blue", alpha=0.2) +
# Update the points to be larger and thicker
geom_point(aes(y=mean, group = stat_group, shape=age_group), size = 6, stroke = 1.5)
# Show the plot
demographics_summary.plot
geom_*()¶On a side note to annotation, sometimes you want to add a little more information to your plot. In our case above, we have the summary data from our plot, but wouldn't it be nice to add some of the actual data points to the visualization?
While it may not be the best choice for this particular plot, it's still something we can do to demonstrate the importance of a layering in our figures. While we haven't explicitly discussed this, it should be clear that each geom_*() draws its data from the initial dataframe provided to the ggplot() call.
Much like mapping individual aesthetics, we can also assign each individual geom_*() its own dataset! Recall that last lecture we introduced the ggbeeswarm package. Let's add some datapoints to our last plot by including a geom_quasirandom() layer. In order to include this data, we need actual data points so we'll generate an intermediate dataframe called covid_demo_long.df.
# Build a long-format dataframe to supply later to our plot
covid_demo_long.df <-
covid_demographics_total.df %>%
# Ungroup this dataframe to clean it up a little
ungroup() %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_deaths
pivot_longer(cols=c(3,4), names_to = "stat_group", values_to = "percent_PHU_total")
demographics_summary.plot +
### 5.4.0 Add our points using beeswarm from a DIFFERENT data set
geom_quasirandom(data = ..., aes(x=age_group, y = percent_PHU_total, group = stat_group),
varwidth=TRUE, method="quasirandom", alpha = 0.5)
Working in biological science, you will often find yourself wanting to italicize species names or add special characters when naming proteins etc. This is not a feat easily accomplished using the options provided by ggplot2. Instead, you can generate string objects with the required font-changes or symbols and then provide these to objects to your plot. In addition to these special text objects, you could also explore packages that add this kind of functionality more organically to your plots.
expression() function to generate an expression object¶There are a few routes to accomplish this kind of formatting. We'll explore the first, expression() which makes an expression object. The expression() function interprets a series of strings and characters into a mathematically-formatted expression. When supplied as an argument, this object is interpreted as a mathematical expression and the output is formatted based on a TeX-like set of rules that parse through the syntax.
Within this function, there are a number of parameters that can seem like functions but are implemented within expression() rather than using the base R functions - so don't expect the same kind of behaviours. Here is a non-exhaustive list of potential situations you may encounter.
| Symbol | Description |
|---|---|
| +, -, %*%, %/%, %+-% | basic mathematical symbols for +, -, *, /, and $\pm$ |
| paste(x,y,z), x*y*z | juxtapose x, y, and z without any separators |
| sqrt(x) | square root of x |
| sqrt(x, y) | the yth root of x |
| plain(x), bold(x), italic(x), bolditalic(x), symbol(x), underline() | draw x in normal, bold, italic, bolditalic, symbol and underlined font |
| list(x, y, z) | output a comma-separated list of x, y, z |
| hat(x), tilde(x), dot(x), bar(x) | add symbols above x |
| alpha to omega, Alpha to Omega | Greek symbols in lower and upper case |
| infinity | the infinity symbol |
| x ~ y, x ~~ y | put a space between x and y or put extra space between them |
| phantom(0) | leave a gap for "0" without drawing it |
| frac(x, y), over(x, y) | output x over y |
| atop (x, y) | output x over y without any bar |
Note from above, to build your expressions from multiple parts, you should use the * or paste() operators from within expression().
demographics_summary.plot +
### 5.5.1 alter title labels using the expression() function
labs(title = ...,
x = "\nAge group",
y = "Proportion of reported PHU data\n",
colour = "Public Health Unit",
caption = ...
) +
# Add our points using beeswarm from a DIFFERENT data set
geom_quasirandom(data = covid_demo_long.df, aes(x=age_group, y = percent_PHU_total, group = stat_group),
varwidth=TRUE, method="quasirandom", alpha = 0.5)
bquote()¶Unlike the expression() function, using bquote() allows you to reference information which may be stored in variables so that you can add these instead of explicitly including the words you want. When thinking about using bquote() you can break your math notation into four forms of syntax. These sections or forms can be joined with the ~ symbol.
| Class of text | Syntax | Description |
|---|---|---|
| Strings | "my text" ~ | Words and non-mathematical text that you want to print as-is |
| Math Expressions | infinity, alpha, frac(x, y) | Unquoted and essentially the same kinds of symbols useable by ?plotmath and expression(). |
| Numbers | 1, 42, 900000 | Use unquoted when part of math notation. |
| Variables | .(variable) | Used to pass in a string or numeric into your equation. Note the period at the front! |
Many R-enthusiasts prefer this form of generating expressions for it's flexibility to build whatever you want.
# First figure out the minimum number of samples per group to generate a variable
sample.min <-
covid_demo_long.df %>%
# Group the data both by age group and then stat group
group_by(age_group, stat_group) %>%
# Generate the number of observations per group
summarise(count = n()) %>%
# Calculate the minimum sample number from our data
.$count %>% min()
# Now build the plot
demographics_summary.plot +
# alter title labels using the expression() function
labs(title = expression("Distribution of"~italic("new cases")~"vs"
~bold("deaths")~"due to COVID-19 across Ontario PHUs"),
x = "\nAge group",
y = "Proportion of reported PHU data\n",
colour = "Public Health Unit",
### 5.5.2 alter our caption using the bquote() function
caption = ...
) +
# Add our points using beeswarm from a DIFFERENT data set
geom_quasirandom(data = covid_demo_long.df, aes(x=age_group, y = percent_PHU_total, group = stat_group),
varwidth=TRUE, method="quasirandom", alpha = 0.5)
ggtext package to create simple markdown code¶As an alternative method to produce simple formatting changes to your text, the ggtext() package provides improved text rendering support for ggplot2. While this package only supports a limited set of Markdown/HTML/CSS syntax, it can handle simple things like bold and italic text, as well as super- and subscripting.
This package provides 2 new theme() elements:
element_markdown(): renders text as markdown/HTML without word wrapping.element_textbox(): creates a markdown/HTML textbox with word wrapping.Both of these elements are meant to effectively replace the element_text() that is native to ggplot2. Let's alter the x- and y-axis text a little bit to see how this works. Remember we'll have to replace both our labels and update the theme() elements we are interested in. More information on the ggtext package can be found here
# Now build the plot
demographics_summary.plot +
# alter title labels using the expression() function
labs(title = expression("Distribution of"~italic("new cases")~"vs"
~bold("deaths")~"due to COVID-19 across Ontario PHUs"),
### 5.5.3 alter our caption using the bquote() function
... = "***Age*** group<sub>binned when retrieved</sub>",
... = "_Proportion_ __of__ <sup>reported <i>PHU</i> data</sup>",
# alter our caption using the bquote() function
caption = bquote("Errobars represent mean "~ ''%+-%'' ~"standard deviation with n"~">="~.(sample.min))) +
### 5.5.3 Convert the proper theme elements to makrdown
theme(axis.title.x = element_markdown(),
axis.title.y = element_markdown()) +
# Add our points using beeswarm from a DIFFERENT data set
geom_quasirandom(data = covid_demo_long.df, aes(x=age_group, y = percent_PHU_total, group = stat_group),
varwidth=TRUE, method="quasirandom", alpha = 0.5)
ggExtra¶Marginal plots are a very specialized plot type from the ggExtra package which combines scatterplot data with distribution data in the margins. The main plot panel has your two variables along the x and y axis. Secondary plots are made on the opposite margins and can be in the form of distribution-based object ie., histograms, boxplots, etc.
The workhorse of this package is the ggMarginal() function which takes as input parameters:
p: the ggplot object you would like to add todata: optional as the information can be drawn from p, otherwise it can be a data.frame object of other datax: the variable name along the x-axisy: the variable name along the y-axistype: the type of marginal plot to show margins: along which margins to show the plotsxparams, yparams: extra paramaters to use only for the x or y marginal plotsgroupColour, groupFill: if TRUE, the colour or fill of the marginal plots will be mapped to the aesthetics of the scatterplotLet's re-imagine our PHU age group data now as a scatterplot with marginal boxplots. While this won't be the clearest visualization of this kind of data it will help to demonstrate how to generate marginal plots with your data.
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=12)
# Build our marginal plot from the wider-format that data we have
phu_age_scatter.plot <-
covid_demographics_total.df %>%
filter(age_group %in% c("20 to 39", "40 to 59", "60 to 79", "80+"),
period == "cumulative") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=..., y = ..., colour = age_group) +
# Themes
theme_grey() +
theme(text = element_text(size = 20), # set text size
legend.position = "bottom" # Move our legend to the bottom
) +
# Update the legend so that the legend keys are larger
guides(colour=guide_legend(override.aes= list(size=4))) +
# Update the labels
labs(x = "Percent cases",
y = "Percent hospitalizations",
colour = "Age group") +
# 3. Scaling
scale_colour_viridis_d(option = "viridis") +
# 4. Geoms
...(size = 4, alpha = 0.8) # Add our data points
# Add our marginal boxplots to our graph
# phu_marginal.plot <- ggMarginal(phu_age_scatter.plot, type="boxplot", groupFill=TRUE, margins="both", size=5)
# plot our marginal plot
# phu_marginal.plot
There are many fantastic R packages to analyze and visualize your data. As a group, we are likely working in a variety of specialized areas. The plots we have made so far today should be useful for data exploration for many different kinds of data. In this final section we are going to learn how to arrange multiple plots per page for those publication-ready figures.
ggarrange()¶There are a variety of methods to mix multiple graphs on the same page, however ggplot2 does not work well with all of them. I am going to work with a package base that uses gridExtra(which allows us to arrange plots) but works well with ggplot2 called ggpubr (which allows us to align the axes of our plots). For a demonstration, we are going to take 3 plots that we made earlier (phu_cases.plot, demographics.plot, phu_marginal.plot) and then arrange and align them in the same figure. (http://www.sthda.com/english/rpkgs/ggpubr/)
Example plot arrangements that can be accomplished with the ggpubr package. |
ggarrange() is a function that takes your plots, their labels, and how you would like your plots arranged in rows and columns. To start let's put our PHU case data (phu_cases.plot) above our PHU age group data (phu_age.plot). If you picture each plot as a square in a grid, we need one column (one for each plot, ncol = 1) and two rows (since they are stacked, nrow = 2).
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
# Arrange the two plots in a single page
ggarrange(..., ...,
labels = c("A", "B"),
ncol = ..., nrow = ...)
Next we will add in the boxplot by nesting a ggarrange() call within another.
Imagine a square with 4 boxes.
To do this, we are arranging 2 rows (one with the line graph and one with the [age group + marginal plot], nrow = 2) and we are arranging 2 columns in the bottom row (one with the age group and one with the marginal plot, ncol = 2).
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
# Arrange the two plots in a single page
ggarrange(phu_cases.plot, # row 1 plot
# row 2 plots
ggarrange(..., ...,
labels = c("B", "C"),
ncol = 2,
nrow = 1
),
# finish specifying characteristics of the two-row arrangement
labels = c("A"),
ncol = 1,
nrow = 2
)
align and font()¶Okay, there are a few problems with this arrangement.
Problem 1: Spacing aside, our title in plot B has spread over into area C. If you wanted to keep it, you would have to fix up the text in the plot and try again. However, we can treat the plots much like their own data and keep altering them with the + symbol. That means for a quick fix, we could just remove the title altogether. Do you remember how to access the plot title?
Problem 2: the x-axes in our B/C plots don't line up well. Would it look better if they did? If y-axis lines or x-axis lines are not aligned, this can be fixed with a call to align = "v" or align="h".
Problem 3: the font labels denoting each plot look a little small overall. We can change this aspect with the font.labels parameter.
If you wanted to make sure all axis titles are the same size you can specify these small changes using font(). You can try to access these attributes through simple names like "axis.title", and "legend.title" ie font("axis.title", size=9) but you need to set each graph and each attribute separately.
Let's drop our plot B title, and try to shore up the axes between B and C. Unfortunately we may be stopped by the crowded spacing at the bottom of these plots.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
plot <-
# Arrange the two plots in a single page
ggarrange(phu_cases.plot,
ggarrange(demographics.plot + ..., ### 6.3.0 remove the title
phu_marginal.plot,
labels = c("B", "C"),
ncol = 2,
nrow = 1,
... = list(size=20), # make the labels larger
align = "h" # Try to align the x-axis of both plots
),
labels = c("A"),
ncol = 1,
nrow = 2,
... = list(size=20) # Match the increased label size of the other plots
)
plot
ggsave(plot, file = "Lec03.ggarange.png", width = 20, height = 20, unit = "in")
ggpubr¶One last tool that you might find useful in your plots is the addition of significance levels or p-values to your plots. Since we've already loaded the ggpubr package, we'll use a function for pair-wise comparisons called stat_pwc() which will allow us to perform a limited analysis of our data.
Before continuing, we should take a look at the compare_means() function to see how ggpubr performs its analyses. This function, like other modeling functions (eg think lm()) can accept a formula based on your variables from a specific set of data. In our case, we'd like to see how, within each age group, the percent cases compares to the percent hospitalizations.
The compare_means() functions has a few relevant parameters to help us out:
formula: the formula we use to define our dependent variable as a function of our independentdata: the data set you will be usingmethod: the type of comparisons you'd like to make as either comparing means directly (t.test or wilcox.test) vs omnibus tests (anova or kruskal.test).ref.group: a character string or numeric value denoting which group the other comparisons are to be made against (think in terms of a control group!)group.by: a character vector stating which additional variables you'd like to use in grouping your data. This is used for grouped plots!p.adjust.method: how you'd like to correct for multiple comparisons (eg. bonferroni, hommel, hochberg, BH, etc)Let's try out the compare_means() function on our COVID-19 demographics data.
covid_demographics_total.df %>%
# Ungroup this dataframe to clean it up a little
ungroup() %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(3,4), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Compare the means of our groups within the data
compare_means(formula = ...,
data = .,
group.by = ...,
p.adjust.method = "hochberg")
geom_pcw() to add significance levels to your plots¶Now that we've seen how compare_means generates output, we can use this knowledge to add pairwise comparison significance levels directly to our plots using the ggplot-friendly layer geom_pcw() which will essentially annotate our plot with the levels.
This function shares many of the same parameters as compare_means() with a few additions:
x, y and other factors.mapping: the same kind of mapping parameters as all other geom layers, this let's us set some aesthetics - most importantly the group aesthetic.y.position: the y-axis value at which we want to display our significance values. This can be a single value or a vector of values to represent each comparison.method: here the choice of methods differs and they come from the rstatix package including wilcox_test, t_test, dunn_test, and tukey_hsdmethod.args: a list of additional arguments that are needed for the test method. For instance tukey_hsd will require a model object (eg lm or aov) to determine its comparisons.label: this determines the source of the labels for your plot. They can include p.adj, p.format, and p.signif as well as an expression using the syntax we have already learned.There are many additional parameters generally for tweaking how the data is displayed. You can find a list of these over on the [ggpubr reference] (https://rpkgs.datanovia.com/ggpubr/reference/geom_pwc.html)
Let's add the Wilcoxon comparisons from our above analysis directly to our grouped violin plots.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=12)
comparison_groups <- list(c("20-39", "40-59"), c("60-79", "80+"))
# comparison_groups <- list(c("percent_cases", "percent_hospitalizations"))
# Build and save the plot for later use
demographics.plot <- covid_demographics_total.df %>%
# Ungroup this dataframe to clean it up a little
ungroup() %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(3,4), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# filter(stat_group == "percent_cases") %>%
# Plot the data as a grouped violin plot with inset boxplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_PHU_total) +
# Start with a base theme
theme_minimal() +
theme(text = element_text(size=20), # set text size to 20
# Move the legend around to within the panel space
legend.justification = c(0,1),
legend.position = c(0.02,0.95),
legend.direction = "horizontal",
# Update the panel to drop the minor axis grid lines
panel.grid.minor = element_blank(),
# Use a black line for the axes
axis.line = element_line(colour = "black"),
axis.text = element_text(colour = "black", face="bold"),
) +
# Add labels to the plot
labs(title = "Percent cases and hospitalizations by proportion per PHU across age group",
x = "\nAge group",
y = "Proportion of reported PHU data\n",
caption = "\n*Age group values are calculated as a percentage of total cases or hospitalizations within a PHU") +
### Use the guides() layer and get rid of the scale_fill_discrete() layer
guides(fill = "none", group = "none") +
# 3. Scaling
scale_y_continuous(limits = c(0, 0.6)) + # Set the limits of our y-axis
# Set the labels of our x-axis categories
scale_x_discrete(labels=c("0-4", "5-11", "12-19", "20-39", "40-59", "60-79", "80+")) +
# Set the colour legend
scale_colour_discrete(name = "Data category", labels = c("% cases", "% hospitalizations")) +
# 4. Data
# multi-factor violin plots but keep the width consistent
### 3.4.0 Link your fill to the colour aesthetic
geom_violin(scale="width",
aes(colour = stat_group, fill=after_scale(alpha(colour, 0.3))),
lwd = 1.5) +
# Boxplot but smaller width so they reside "within" the violin plot
geom_boxplot(aes(fill = stat_group), width=0.2,
position = position_dodge(width=0.9),
outlier.shape=NA) + # Remove the outliers
# Add in all of the data points
geom_quasirandom(dodge.width = 0.85, aes(group=stat_group), alpha = 0.8) +
### 6.4.1 Add in signifcance values to your plot
geom_pwc(mapping = ..., # Set the grouping to use stat_group (like group.by)
method = "wilcox_test", # Use a non-parametric test
label = ..., label.size = 10, # Label with significance levels instead of p-values
y.position = c(0.2, 0.2, 0.2, 0.45, 0.45, 0.5, 0.5)) # Reposition the y-axis location of individual labels
# Show the plot
demographics.plot
Today we have dug deep into altering and playing with our plots to help get them to that extra level. Although there is far more to explore, this should cover most of your needs when it comes to cleaning up your plots. To recap, we've looked at:
Looking a little bit ahead at this week's assignment, you will look at canada-wide vaccination data.
You now have the tools to create plots like this:
![]() |
|---|
| Overall vaccination rates amongst provinces! |
This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 9:59am on the March 28th, 2023 (Yes, even though we don't have class that day).
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
The R Graph Gallery: https://www.r-graph-gallery.com/index.html
Different aesthetics parameters in ggplot(): https://ggplot2.tidyverse.org/reference/aes_group_order.html
Which aesthetics can be altered for different geoms?: https://ggplot2.tidyverse.org/reference/aes_linetype_size_shape.html
Advanced examples of direct labeling with geom_dl(): https://directlabels.r-forge.r-project.org/examples.html
More information about the gghighlight package: https://cran.r-project.org/web/packages/gghighlight/vignettes/gghighlight.html
Using expression(): https://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/plotmath.html
Using bquote(): https://www.r-bloggers.com/2018/03/math-notation-for-r-plot-titles-expression-and-bquote/
More options for ggarrange(): https://rpkgs.datanovia.com/ggpubr/reference/ggarrange.html
Learning some of the functions for ggExtra: https://cran.r-project.org/web/packages/ggExtra/vignettes/ggExtra.html
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.
